Finding regulatory modules through large-scale gene-expression data analysis 
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The use of gene microchips has enabled a rapid accumulation of gene-expression data. One of the 
major challenges of analyzing this data is the diversity, in both size and signal strength, of the various 
modules in the gene regulatory networks of organisms. Based on the Iterative Signature Algorithm 
[Bergmann, S., Ihmels, J. & Barkai, N. (2002) Phys. Rev. E67, 031902], we present an algorithm — 
the Progressive Iterative Signature Algorithm (PISA) — that, by sequentially eliminating modules, 
allows unsupervised identification of both large and small regulatory modules. We applied PISA 
to a large set of yeast gene-expression data, and, using the Gene Ontology database as a reference, 
found that the algorithm is much better able to identify regulatory modules than methods based on 
high-throughput transcription-factor binding experiments or on comparative genomics. 



I. INTRODUCTION 

The introduction of DNA microarray technology has 
made it possible to aquire vast amounts of gene expres- 
sion data, raising the issue of how best to extract infor- 
mation from this data. While basic clustering algorithms 
have been successful at finding genes that are coregu- 
lated for a small, specific set of experimental conditions 
[H [3- 03l - these algorithms are less effective when applied 
to large data sets due to two well-recognized limitations. 
First, standard clustering algorithms assign each gene to 
a single cluster, while many genes in fact belong to mul- 
tiple transcriptional regulons. 0,011, EI- Second, each 
transcriptional regulon may only be active in a few ex- 
periments, and the remai ning experiments will only con- 
tribute to the noise 0, 0, Hlf 

A number of approaches have been proposed to over- 
come one or both of these problems H |E II II H E3 • 
A particularly promising approach, the Signature Algo- 
rithm (SA) was introduced in 2002 by Ihmels et al. [ill] . 
Based on input sets of related genes, SA identifies "tran- 
scription modules" (TMs), i.e. sets of corcgulatcd genes 
along with the sets of conditions for which the genes are 
strongly coregulated. SA is well grounded in the biology 
of gene regulation. Typically, a single transcription fac- 
tor regulates multiple genes; a TM naturally corresponds 
to a set of such genes and the conditions under which the 
transcription factor is active. The authors tested the al- 
gorithm on a large data set for the yeast Saccharomyces 
cerevisiae. By applying SA to various sets of genes that 
were known or believed to be related, they identified a 
large number of TMs. 

Soon after, Bergmann et al. ^3 published the Iterative 
Signature Algorithm (ISA), which uses the output of SA 
as the input for additional runs of SA until a fixed point 
is reached. By applying ISA to random input sets and 
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varying the threshold coefficient tc (sec below), the au- 
thors found almost all the TMs that had been identified 
using SA, as well as a number of new modules. Many of 
these modules proved to be in excellent agreement with 
existing knowledge of yeast gene regulation. 

While ISA can identify many transcriptional regulons 
from gene-expression data, the algorithm has significant 
limitations. The modules found depend strongly on the 
value of a threshold coefficient to used in the algorithm. 
To find all the relevant modules, a large range of thresh- 
old values must be considered, and for each threshold the 
algorithm may find thousands of fixed points, many of 
which arc spurious. While the largest, strongest modules 
are easily identified, among the smaller, weaker modules 
it is a major challenge to identify the real transcriptional 
regulons. Weak modules can even be completely "ab- 
sorbed" by stronger modules. 

The performance limitations of ISA are related to a 
number of algorithmic limitations. The need for a large 
range of thresholds is partially due to the threshold def- 
inition, and the large number of fixed points is due to 
the large positive feedback in the algorithm. The main 
conceptual limitation of ISA, however, is that it only con- 
siders one transcription module at a time. The algorithm 
does not use knowledge of already identified modules to 
help it find new modules, and it may find a strong module 
hundreds of times before it finds a given weak module. 
An even worse case is shown in Fig. ^ When a strong 
and a weak module are coexpressed for a significant frac- 
tion of conditions, it may be impossible to find the weak 
module by itself — ISA will find only a single stable fixed 
point, dominated by the strong module. 

A simple way to ensure that the same module is not 
found repeatedly is to directly subtract the module from 
the expression data, (this approach is used in [Tol]). A 
more robust approach is to require the condition vector, 
i.e. the weighted condition set, of each new transcrip- 
tion module to be orthogonal to the condition vectors 
of all previously found modules. In essence, this proce- 
dure corresponds to successively removing transcription 
factors to reveal smaller and weaker transcription mod- 
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FIG. 1: A toy model with only two transcription modules, 
(a) Module 1 is upregulated under condition A, while module 
2, a larger, stronger module, is upregulated under conditions 
A and B. (b) Normalized histograms of the gene scores given 
by the Signature Algorithm (SA) for the background (solid 
fill), module 1 (solid line) and module 2 (dotted fill), when 
using the true condition vector for either module 1 (condition 
A) or module 2 (conditions A+B). Even starting with the true 
condition vectors, SA does not resolve the two modules. Nor 
can the Iterated Signature Algorithm (ISA) resolve module 1, 
even if it receives the module itself as input gene set, as the 
genes from module 2 have higher scores also for condition A 
(there is only one fixed point of ISA). Due to the background 
noise, it is also impossible to separate the modules by varying 



II. METHODS 
A. The Algorithms SA/ISA 

We briefly review the algorithms SA and ISA. A tran- 
scription module M can be specified by a condition vector 
("experiment signature") m c and a gene vector ("gene 
signature") m G , where nonzero entries in the vectors in- 
dicate conditions/genes that belong to the transcription 
module (TM). 

Given an appropriately normalized 1 matrix E of log- 
ratio gene expression data and an input set G\ of genes, 
SA scores all the conditions in the data set according to 
how much each condition upregulates the genes in the 
input set (downregulation gives a negative score). The 
result is a condition-score vector s c : 



E T m G 



(1) 



where E T is the transpose of E and 
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FIG. 2: Once the Progressive Itererative Signature Algorithm 
(PISA) has eliminated the combined module 1+2 from Fig.0 
(dashed line), the remaining signal makes it easy to separate 
the genes of module 1 from the genes of module 2. (a) Re- 
maining signal for each module, (b) Actual gene scores for 
the new fixed point found by PISA. Genes of module 1 (solid 
line) have been separated from genes of module 2 (dotted fill) 
and the background (solid fill). 



ules. The successive removal of condition vectors is the 
central new feature in our approach, and it is illustrated 
schematically in Fig. [3 We call the modified algorithm 
the Progressive Iterative Signature Algorithm (PISA). 
Returning to the example in Figs. and [3 one finds 
that PISA can easily identify both TMs: it first finds the 
strong module, removes its condition vector, and then 
the only signal left is that of the weak module. 

Progressively eliminating transcription modules a la 
PISA can also improve the prospects for finding unre- 
lated modules. The gene regulation from one module 
will contribute to the background noise for all unrelated 
modules. Therefore, eliminating large, strong modules 
can significantly improve the signal to noise ratio of the 
remaining modules. 



(m 



in; a 



(2) 



is the gene vector corresponding to the input set. The 
entries of s c that are above/below a threshold ±tc con- 
stitute the condition vector m c : 



(m c ) c ^(s^) c .e(|(s^) c |-i c ), 



(3) 



where Q(x) = 1 for x > and 0(x) = for x < 0. 

Similarly, the gene-score vector s G measures how much 
each gene is upregulated by the conditions in m c , using 
the entries of m c as weights: 



E m c 



(4) 



The entries of the gene-score vector s G that are more 
than £q standard deviations a gG above the mean gene 
score in the vector s G constitute the gene vector m G : 

(m G ) 9 = (s G ) s • 6((s G ) 9 - ((s G ) 9 ) s - t G a sG ) (5) 

ISA uses m G as the input for the next iteration, 
i.e. the genes are now weighted according to their gene 
scores, until a fixed point is reached. 



B. The Algorithm PISA 

Orthogonalization. Within PISA, each condition-score 
vector s is required to be orthogonal to the condition- 
score vectors of all previously found transcription 



1 S A actually uses two matrices with different normalizations 111! . 



modules (TMs). Therefore, whenever PISA finds ; 
TM and its associated condition-score vector s c , thi 
component along s c of each gene is removed from th< 
gene expression matrix (see Implementation of PIS/ 
below). This requirement of orthogonality in PIS^ 
conflicts with the condition-score threshold as used ii 
ISA. If we make the condition-score vector orthogona 
first and then apply the threshold, the vector will n< 
longer be orthogonal, whereas if we apply the threshok 
first, orthogonalization will give nonzero weight t< 
all conditions, eliminating the noise-filtering benefit o 
thresholding. We have chosen to eliminate the condition 
score threshold completely. In any event, condition 
that in ISA would fall below the threshold will have lov 
weight and will give only a small contribution to th< 
noise. 

The gene-score threshold. In ISA, to find all modules, 
it is necessary to run the algorithm with many different 
threshold coefficients tQ. Fot low thresholds one finds 
a few very large modules (many genes), while for high 
thresholds one finds many small modules (few genes). 
Without prior knowledge of the module one is searching 
for, it is difficult to know what tQ to use. Within PISA, 
we wish to find all the modules using a single threshold. 
This requires modifying the threshold definition. In ISA, 
the gene-score threshold is taa lsA where the standard 
deviation c ISA is computed using the full distribution 
of gene scores, and includes contributions both from the 
background and from the module of interest (Fig.[3J|. For 
large, strong modules, the module contribution may be 
larger than the background contribution. As a result, 
(T ISA is module dependent and tQ must be adjusted to 
prevent false-positives from the background. 

We eliminate this problem in PISA by specifying the 
threshold relative to the background, which we estimate 
using the mean, (cc} 70< ^, and the standard deviation, 
^ f gene scores within the shortest interval that 
contains at least 70% of all the gene scores. By excluding 
extreme gene scores in this way, we minimize the influ- 
ence of the module on the means and standard deviations 
of gene scores (Fig. |3J|. As a test, we used a 70% in place 
of er ISA in ISA and found both very large and very small 
modules with a single value of tQ. 

We need to be conservative when selecting the gene- 
score threshold because, if PISA misidentifies a module, 
elimination of its condition vector can lead to errors in 
other modules. Therefore, the number of genes included 
in modules due to noise should be very low. We have used 
a threshold of 7.0cr 70% , which for a Gaussian distribution 
corresponds to about 3.9a. The chance of including a 
gene due to noise is about 10 -4 per gene, e.g. with the 
6206 genes in the yeast data set, the average number 
of genes included by mistake in each module would be 
about 0.62. Using a high threshold means that we may 
miss genes that should belong to a module, however this 
is less risky than including genes by mistake. As PISA 
proceeds by eliminating condition-score vectors, it does 
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FIG. 3: Means and standard deviations as used in ISA and in 
PISA algorithms, calculated using all the genes (top bars) or 
only the non-module genes (bottom bars). The mean (x) 70% 
and standard deviation a 70% from PISA, using only the dis- 
tribution within the shortest interval that contains 70% of 
all genes, are almost identical to the ideal values (x) hg and 
0.56<r bg of the background noise (non- module genes). In ISA, 
the mean and standard deviation are calculated from the 
whole distribution and so are strongly module dependent. 
This example uses generated data for a module of 300 genes 
out of 6206 total genes. The non-module genes have a normal 
distribution of gene-expression levels. 



not matter whether we identify all the genes in a module, 
as long as the condition-score vector is accurate. Once, 
PISA has finished, we can easily see which genes would 
be included when using various gene-score thresholds for 
the same condition-score vector. 

ISA only considers sets of genes that have high gene 
scores, i.e. positive signs. As discussed in this can 
lead to two modules that are regulated by the same 
conditions but with opposite sign. In contrast, PISA 
includes all genes with sufficiently extreme scores in 
a single module, and the relative signs of gene scores 
specify whether the genes are coregulatcd or counter- 
regulated. 

Implementation of PISA. To begin, PISA requires a ma- 
trix E of log-ratio gene-expression data, with zero average 
for each condition. Two matrices are obtained from E: 
The first Eq is normalized for each gene 

((E G ) ffC ) c = 0, ((E G )2 c ) c = l V.geG. 

Normalization of Eq is essential so that the gene-score 
threshold can be applied to all genes on an equal footing. 
The second matrix Ec is obtained from Eq by normal- 
izing for each condition, ((Ec,o)g C )g = b where Ec,o 
denotes the initial Ec- (Note that this is essentially the 
opposite of the notation used in We then apply a 

modified version of ISA, mISA (see below), a large num- 
ber of times (typically 10,000), and whenever mISA finds 
a module, we remove from Ec the components along the 
module's condition score vector s c : 
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As mISA is repeatedly applied, new modules arc 
found less and less frequently. For example, one run of 
10,000 applications of mISA found 496 modules, and 
287 of them were found in the first 1,000 applications. 
As the later modules are also generally smaller and less 
reliable, the exact number of times mISA is applied is 
not very important. 

mIS A. As input, the modified Iterative Signature Algo- 
rithm (mIS A) requires the two matrices Ec and Eq- We 
start each application of mISA by generating a random 
set of genes Go and a corresponding gene vector mp : 

fm G ) - { 1 geGo 
[m ° j » _ \ g(£G 

Each iteration i within mISA consists of multiplying the 
transpose of Ec by the gene vector mp to produce the 
condition-score vector sp: 

sp ee E^mp, 

and then multiplying Eo by the normalized condition- 
score vector to produce the gene-score vector sf : 

„G _ Eosf 

i _ |_C| ' 
\ i I 

From sp, one calculates the gene vector mp^ for the 
next iteration: 

(m? +1 ) 9 ee ( B f), 6(\(s?) g - ({sf)g) 7 g 0% \ - t G alf) 

We iterate until: (a) (mp) g and (mf +1 ) g have the same 
sign (0, + or -) for all g, (b) the iteration number is 
i = 20, or (c) fewer than two genes have nonzero weight. 
If fewer than five genes have nonzero weight (for (a) or 
(b)), the result is discarded, otherwise we have found a 
module with condition-score vector s c = sp, gene-score 
vector s G = sp, and gene vector m G = mf^. 

We chose a threshold coefficient t& = 7.0 so that the 
expected number of genes included in each module due 
to background noise would be less than one. However, 
with this high threshold, starting from a random set 
of genes there was only a very low chance that two or 
more genes would score above the threshold in the first 
iteration 2 . To increase the chance of finding a module, 
we used a different formula for mp. Instead of selecting 
only genes with scores above the threshold, we kept a 
random number 2 < n < 51 of the genes with the most 
extreme scores. This procedure was generally adequate 
to produce a correlated set of genes for the next iteration. 



2 This is not an issue in ISA, where the condition threshold helps 
to pick out the, possibly very small, signal from the noise. 



Consistent modules. ISA typically finds many different 
fixed points corresponding to the same module, each 
differing by a few genes. PISA only finds each module 
once during a run, but the precise genes in the module 
depend on the random input set of genes and also 
on which modules were already found and eliminated. 
Furthermore, PISA sometimes finds a module by itself, 
while other times it may find the module joined with 
another module, or PISA may find only part of a module, 
or not find the module at all. To get a reliable set of 
modules, it was necessary to perform a number of runs 
of PISA and identify the modules that were consistent 
from run to run. To identify consistent modules, we first 
tabulated preliminary modules - transcription modules 
found by individual runs of PISA. A preliminary module 
contributes to a consistent module if the preliminary 
module contains more than half the genes in the full 
module, regardless of gene-score sign, and these genes 
constitute at least 20% of the genes in the preliminary 
module. A gene is included in the consistent module if 
the gene occurs in more than 50% of the contributing 
preliminary modules, always with the same gene-score 
sign. 

Correlations between condition-score vectors. Once we 
identified a consistent module, m G , we calculated the 
raw condition-score vector r = Ej m G , using the initial 
value of the gene-expression-data matrix Ec- From the 
r's we evaluated the condition correlations r • r'/(|r| |r'|) 
between different modules. 

Additional details arc discussed in the supporting mate- 
rial. 



C. p- Values 

Given a set containing m genes out of the total of Nq, 
the p- value for having at least n genes in common with 
a Gene Ontology (GO) category containing c of the Nq 
genes is 

minjc.m} / c \ /ATq— c\ 

( N G) ■ ( 7 ) 
i—n \ m J 

We ignore any genes that are not present in our expres- 
sion data when counting c. 



III. RESULTS 

We applied PISA to the yeast data set used in [T^ . 
which consists of log-ratio gene-expression data for Nq = 
6206 genes and Nq = 1011 experimental conditions (ap- 
proximately 10% of the values are missing or invalid). 
Normalization gives the matrices Eq and Ec (see Meth- 
ods for details). 
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As a preliminary test, we repeatedly applied PISA to 
one fully scrambled version of the matrix Eq (and the 
corresponding Ec). From run to run, the algorithm iden- 
tified many large modules derived almost entirely from a 
single condition, as expected in light of the broad distri- 
bution of the raw gene-expression data (Fig. EU in sup- 
porting material). PISA also found many small mod- 
ules, but these differed from one run to the next. We 
were able to eliminate both these classes of false positives 
using filters for consistency, recurrence, and number of 
contributing conditions (Fig. [32] in supporting material). 

We performed 30 runs of PISA on the yeast data set 
and identified the modules that appeared consistently, 
using the filters derived above. At the start of each run, 
only a few modules could be found with our single choice 
of gene threshold tc- Nevertheless, PISA did consistently 
find new modules after eliminating others, demonstrating 
that removing the condition vectors of found modules 
improves the signal to noise for the remaining ones. 

For most of the modules we found, the genes were 
corcgulatcd, i.e. all the gene scores had the same sign. 
(In contrast, the modules that were eliminated by the 
filters often had about equal numbers of genes of ci- 
ther sign.) There were, however, a significant number 
of modules with a few gene scores differing in sign from 
the rest, and a few modules with many gene scores of 
both signs, e.g. the a/a pheromone production/detection 
module. Furthermore, many of the modules found by 
PISA agreed closely with modules identified by ISA at 
various thresholds, while other PISA modules were sub- 
sets of ISA modules. Some PISA modules, for example 
the dc novo purine synthesis module (Fig. EJ, were sig- 
nificantly more complete than the ones found by ISA (at 
any threshold). 

PISA found several small modules that agree very well 
with known gene regulation in yeast. For example, the 
arginine-biosynthesis module consists of ARG1, ARG3, 
ARG5,6, ARG8, CPA1, CTF13, and CAR2; out of these 
CAR2 has a negative gene score, i.e. it is counter- 
regulated relative to the others. The first five genes 
are precisely the arginine-synthesis genes known to be 
repressed by arginine, while CAR2 and CAR1 (which 
is the 2nd highest scoring gene that failed to make the 
threshold) are catabolic genes known to be induced by 
arginine jlq . 

PISA also found a zinc (zapl regulated) module con- 
sisting of ZRTI, ZRT2, ZRT3, ZAP1, YOLI54, INOl, 
ADH4, and YNL254C. These are almost exactly the high- 
est scoring genes in a microarray experiment compar- 
ing expression under zinc starvation of a zapl mutant 
vs. wild type |l5j . however, our data set docs not in- 
clude this or any other zinc starvation (or zapl mutant) 
experiment — indeed, there are no experimental condi- 
tions that have a remarkably high score for this module, 
although conditions from the Rosetta compendium |l6j |. 
most of which arc deletion mutant experiments, tend to 
have much higher scores than the other conditions (see 
supporting material). This module, as well as the star- 
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FIG. 4: The de novo purine synthesis module found with 
PISA. 
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FIG. 5: Correlations between modules identified by PISA. 
The modules are ordered to form clusters: In the lower left 
corner are the main amino acid biosynthesis module and sev- 
eral smaller, more specific biosynthesis modules; the other 
main clusters are, roughly, stress response, mating, and ribo- 
somal proteins/rRNA processing. 
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(a) PISA vs. database A (b) PISA vs. database B 
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FIG. 6: Best p-values onto every Gene Ontology (GO) cate- 
gory. In each panel, we include only GO categories for which 
at least one p- value is below 10~ 10 . (a) PISA vs. Database A. 
(b) PISA vs. Database B. (a) inset: Database A vs. database 
B — there are very few GO categories onto which both A and 
B have low p-values. 



vation experiments in [ 1 5| and direct transcription fac- 
tor binding experiments (see below), all indicate that 
YNL254C is regulated by zapl, and it probably has some 
function related to zinc starvation/ uptake. 

In order to evaluate the overall performance of PISA, 
we compared our modules to the c ateg ories in the Gene 
Ontology (GO) curated database ^21 • For the set of 
genes in each of our modules we calculated the p-value 
for the overlap with the set of genes in every GO cate- 
gory (see Methods). The p- value is the probability that 
an observed overlap occurred by chance. The lowest p- 
value we found was 3.5 • 1CP 216 , for the GO category 
"cytosolic ribosome" , and we found p- values below 10~ 20 
for more than 140 other GO categories. (The modules 
that were removed by our filters mostly did not have sig- 
nificant p-valucs, and none were below 10 -10 ). We used 
the p- values between our PISA modules and the GO cat- 
egories to compare PISA to other means of identifying 
transcriptional modules. Specifically, we compared PISA 
to two different databases of genes predicted to be regu- 
lated by single transcription factors. Database "A" con- 
tains genes that were enriched through immunoprecipi- 
tation with tagged transcriptional regulators [l3j, while 
Database "B" has genes sharing regulatory sequences de- 
rived by comparative genomics [l4T |. Figure shows the 
p- values between GO and PISA compared to the p- values 
between GO and each of these two databases. 3 The lower 
p-values for PISA indicate a consistently better agree- 
ment between GO and PISA than between GO and the 
other databases. For a few GO categories Database B 
has a lower p-value than PISA, but these categories are 
all close to the root of the GO tree and each contains 
more than half the genes in yeast. 

Compared to microarray data, both Database A and 



3 We used an int erna l p-value threshold of 0.001 for Database A, 
as suggested in ll.'l . 



Database B have a clear disadvantage: their binding 
sites are assigned to intergenic regions, and if the two 
genes bordering an intergenic region arc divergently tran- 
scribed, then the databases do not identify which of the 
genes is regulated. In many cases, we found that by com- 
paring sets of genes in database A to PISA modules, we 
could decide which of divergently transcribed genes were 
actually regulated. For example, Database A lists 6 in- 
tergenic regions as binding site for zapl at an internal 
p- value threshold of 10~ 5 , and 4 of these lie between 
divergently transcribed genes. However, 5 of the 6 in- 
tergenic regions border the genes ZRT1, ZRT2, ZRT3, 
ZAP1, and YNL254C which PISA identifies as part of 
the zinc module. 

Database A appears to have an additional source of 
false positives. Intergenic regions that are close to in- 
tergenic regions with very low p-values often have low 
p- values themselves, even when there is no apparent con- 
nection between the genes and no evidence of a binding 
site in the DNA sequence. For example, for the de novo 
purine-biosynthesis module, which is primarily regulated 
by the basl transcription factor, the intergenic region 
controlling GCV2 has the lowest p- value within Database 
A, 1.1 • 10^ 16 , and all the four closest intergenic regions 
have p- values below 10 -5 . Comparison to PISA modules 
can help eliminate these potential false positives: out of 
the 29 genes assigned a p- value below 10 -4 for basl bind- 
ing in database A, 13 belong to a single PISA module, 
4 others are divergently transcribed adjacent genes, and 
6 others are genes transcribed from nearby intergenic re- 
gions. 



IV. DISCUSSION 

The Progressive Iterative Signature Algorithm (PISA) 
embodies a new approach to analysis of large gene- 
expression data sets. The central new feature in PISA 
is the robust elimination of transcription modules as 
they are found, by removing their condition-score vec- 
tors. Also new to PISA, compared to its precursors SA 
[ill and ISA H3, is the inclusion of both coregulated and 
counter-regulated genes in a single module, and the use 
of a single gene-score threshold. 

Altogether, these new features result in an algorithm 
that can reliably identify both large and small regulatory 
modules, without supervision. We confirmed the per- 
formance of PISA by comparison to the Gene Ontology 
(GO) database - PISA performed considerably better 
against GO than either high-throughput binding experi- 
ments or comparative genomics. PISA therefore provides 
a practical means to identify new regulatory modules and 
to add new genes to known modules. 

Can PISA shed any light on the organization of gene 
expression beyond the level of individual transcription 
modules? In the authors argued that they could 
trace the relationship between modules from the effects 
of changing the threshold tQ. F° r instance, a large 
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module might split into two smaller ones as £q was in- 
creased. With PISA, we were able to use a more di- 
rect approach. Once we identified the modules, we com- 
puted the "raw" (i.e. pre-eliminations) condition-score 
vector r for each module, and from these raw condition- 
score vectors, we evaluated the condition correlations be- 
tween modules (see Methods). Figure shows the con- 
dition correlations between 40 of the modules that we 
can put a name to. A large, positive correlation between 
two modules can either indicate that the modules have 
many genes in common, e.g. the genes of the arginine- 
biosynthesis module are essentially a subset of the genes 
of the amino-acid-biosynthesis module, or, as in the toy 
model in Figs. ^ and |3 the modules have few/no genes 
in common, but the two sets of genes are similarly regu- 
lated under many conditions. In the toy model, the raw 
condition-score vectors ri and r2 correspond to the vec- 
tors in Fig.QJa) and their correlation, ri • r2/(|i"i| |ra | ) , 
is simply the cosine of the angle between them. A real 



example of this second type of correlation is provided by 
the ribosomal-protein module (104 genes) and the rRNA- 
processing module (144 genes). They have no genes in 
common, but the correlation between them is very high, 
0.76. 

Out of the 6206 genes included in the expression data, 
2626 genes appeared in at least one module, and 923 
genes appeared in more than one module 4 . No genes 
appeared in more than 4 different modules. 
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. SUPPORTING MATERIAL 
A. Normalization 

Here we review in detail the normalization procedure 
employed in PISA. The most obvious requirement for the 
normalization is that scores for different genes must be 
comparable. The procedure itself is as follows: Given a 
matrix E of log-ratio gene-expression data, we first set 
the average to zero for each condition, 

(E') ff c = (E) gc -((E) s , c ) s , ) (SI) 

and then normalize to zero mean and unit variance for 
each gene, giving E G , which is used in PISA to calculate 
gene scores: 

(E") sc = (E'),c - ((EU_ (S2) 

(E G ) 9C = (E") gc /- V /((E")^)c (S3) 

For this normalization to be consistent through the itera- 
tions in mISA, the different condition scores must also be 
comparable. To get the initial value E G .o of the matrix 
used to calculate condition scores, we divide E G by the 
rms value for each condition: 

(E c ,o) 3C = (E G ) 3C / V /((E G )2 V } 9 ,. (S4) 

Note that a simple approach would be to normalize for 
both genes and conditions simultaneously and thus use 
only a single set of data 5 — this could be easily accom- 
plished by alternately normalizing over conditions and 
genes a few times; the data converge quickly. There is, 
however, a risk of losing significant features of the data 
through excessive normalization. For some conditions, 
the typical change in expression levels may be very large, 
while for others it may be negligible, and it would be 
misleading to always normalize these to the same level; 
at the very least, this would give a lower signal to noise 
ratio. Therefore, we have chosen to normalize E G over 
genes but not conditions, allowing conditions with large 
changes in expression level to make a proportionately 
larger contribution to gene scores. For genes, however, 
it is reasonable to always normalize to the same level. If 
two genes are in the same module, then there is little rea- 
son to consider the gene with the larger dynamical range 
to be more reliable than the other. That is why we use 
E G to calculate Ec,o- 

Also note a the difference between genes and condi- 
tions: The variance for a gene often depends on a small 
number of outlying values, and normalizing over genes 
prevents these from dominating. In contrast, the vari- 
ance for a condition typically depends on many genes, 
and as such is a far more reliable quantity. 



5 If Eq = Ec,0 initially, then it is equivalent to keep Eg constant 
or use Eq = Ec, which is updated every time PISA finds a 
module. 



B. Avoiding Positive Feedback 

The basic principle of SA, or an iteration of ISA/mlSA, 
is to find the set of genes whose expression profiles most 
resemble those of the genes in the input set, either for all 
conditions (mISA) or for a selected subset of conditions 
(SA/ISA). Of course, the gene whose expression profile 
most resembles that of a given gene is the gene itself, 
thus there is a potential for significant positive feedback. 
Adding one gene to the input set would typically increase 
the score of that gene far more than the score of any other 
gene. As a consequence of positive feedback, adding one 
gene to the gene vector of a fixed point would have a 
considerable chance of yielding another fixed point, and 
a small set of genes could be a fixed point even if the 
genes were completely uncorrelated. 

In PISA, we only find each module (or combination) 
once for each run, and it is important to be as certain 
as possible that we have the correct genes. We avoid 
positive feedback by using leave-one-out scoring for genes 
that had nonzero weight at the start of the iteration, 
i.e. we remove the contribution from gene g from the 
condition scores sf before we use these scores to calculate 
the new score for gene g: 



[Si)s ~ |s?-(ET). g ( m p )g | ' 

where (A)j_ is row j of matrix A, and (A)_j is column j 
of matrix A. With a Gaussian distribution of the back- 
ground noise, this approach is very close to neutral, i.e. 
adding a gene will neither affect that gene's score, nor 
will it significantly change er 70% of the gene-score distri- 
bution. 

Without positive feedback, fixed points may be 
marginally stable (or even unstable, i.e. a limit cycle), 
thus we do not require a true fixed point; we accept any 
gene vector reached after 20 iterations in mISA, as long 
as it contains at least 5 genes. 

In SA/ISA, the authors do not eliminate positive 
feedback. Indeed it would be difficult to do so, as 
adding/removing a gene can change which conditions 
have scores exceeding the condition threshold. Apart 
from this complication, the feedback in SA/ISA is pro- 
portional to the number of conditions that make the 
threshold. For small modules, typically only a small frac- 
tion of the conditions have scores above the threshold, 
thus the feedback is lower than it would have been for 
PISA, which includes all conditions. For large modules, 
the feedback is only a minor effect in the first place. Nev- 
ertheless, the total number of fixed points for ISA is huge 
due to positive feedback — at a gene threshold coefficient 
i G = 4.0, there are, at a minimum, more than a million 
fixed points. 
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C. Filters 

We chose the gene-score threshold as 7.0cr 70% so that, 
on average, less than one gene would be included in a 
module purely due to background noise. This estimate 
assumed that the background noise had a Gaussian dis- 
tribution. For most modules, the gene scores are the 
sums of contributions from many different conditions, 
and if these contributions are independent, as they should 
be for background noise, then the total background noise 
will have approximately a Gaussian distribution, regard- 
less of the distribution for a single condition (central limit 
theorem). For modules that derive almost entirely from 
one or very few conditions, however, the distribution of 
gene scores may not be Gaussian. 

While we do not know the true distribution of the back- 
ground noise, it is reasonable to use the full distribution 
of the data as a worst case scenario. As shown in Fig. lSlI 
this distribution is far from Gaussian: it has a fairly 
sharp cusp at zero and long tails, even after normaliza- 
tion. For this distribution, more than 3.5% of the values 
are outside the threshold ±7.0er 70% (this is partially be- 
cause the long tails contain many genes, and partially 
because 

a 70% is 

small due to the sharp cusp) , i. e. with a 
gene-expression matrix randomly drawn from this distri- 
bution, for any single condition one would expect to find 



based on many conditions were much smaller. We also 
applied PISA to a random matrix generated from a Gaus- 
sian distribution, and in that case PISA did not find any 
large modules (in 30 runs, PISA found 8 modules with 20 
or more genes; the largest contained 26 genes). In both 
cases, the small modules found by PISA varied from run 
to run. 

In order to eliminate these false modules we introduced 
a set of filters. For each preliminary module M we cal- 
culate the "number of contributing conditions" , given as 
n M = S c ( s °)c/( max {( sC )c}) 2 - We ignored any mod- 
ule for which the median of the numbers of contribut- 
ing conditions for its preliminary modules was below 6 
(this threshold worked well; it is somewhat above the 
threshold required to remove the false positives for the 
scrambled matrix). We also ignored all modules that had 
fewer than 5 genes or fewer than 5 contributing prelimi- 
nary modules, and for modules with fewer than 10 genes 
we required that the "consistency" , defined as the aver- 
age fraction of the genes in the preliminary modules that 
are in the full module, was above 0.55 (during post pro- 
cessing, we required that this fraction was above 0.2 for 
each preliminary module). These filters removed all but 
one of the modules found by PISA when applied to the 
scrambled matrix. 



Module characteristics 



(a) Raw expression data 



(b) Normalized expression data 
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FIG. SI: Distributions of the yeast microarray data used 
(6206 genes/ORFs, 1011 conditions). Roughly 10% of the 
data was invalid/missing (not included in the distributions). 
The distribution is sharply cusped and has long tails, both 
before and after normalization fEas. ISlllS3(l . 

We applied PISA to a matrix Eq that had been fully 
scrambled after normalization 6 . As shown in Fig. IS2I 
PISA found many large modules that were based almost 
entirely on a single condition (however, as the modules 
were not based on only one condition, they were not as 
large as our estimate of 200, above), whereas modules 
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FIG. S2: The number of genes in a module M and the 
number of contributing conditions n^j (see text) were two of 
the properties we used in our filters to eliminate false modules. 
PISA applied to a scrambled expression matrix (black) only 
yielded modules close to the axes (small or small n^), 
while PISA run on the real data (green) yielded modules with 
both large and large n£j. 
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# 


# 




Over. 


Best 




Function 


genes 


cond. 


Cons. 


w/ISA 


*G 


Freq. 


Amino acid biosynthesis 


96 


31.2 


0.83 


0.89 


3.7 


10090 


Arginine biosynthesis 


6 


5.7 


0.72 


0.83 


6.0 


60 


Biotin synthesis & transport 


6 


6.5 


0.80 


0.67 


5.5 


7 


Lysine biosynthesis 


11 


9.0 


0.82 


0.82 


4.6 


10 


De novo purine biosynthesis 


32 


13.1 


0.83 


0.59 


5.0 


16 


Oxidative stress response 


69 


23.8 


0.91 


0.32 


3.4 


(1) 


Aryl alcohol dehydrogenases 


6 


15.4 


0.62 


0.83 


4.9 


8 


Proteolysis 


27 


82.1 


0.80 


0.86 


3.6 


1661 


Trehalose & hexose metabolism/conversion 


21 


34.9 


0.55 


0.67 


3.2 


910 


COS genes 


11 


9.2 


0.49 


1.00 


3.3 


756 


Heat shock 


52 


42.8 


0.78 


0.38 


3.2 


(1) 


Repair of disulphide bonds 


26 


41.6 


0.73 


0.58 


3.5 


15 


Calcium-calmodulin related 


41 


32.5 


0.78 


0.73 


3.0 


2198 


Oxidative phosphorylation 


42 


48.3 


0.89 


0.95 


3.7 


2600 


Gluconeogenesis, fatty acid beta-oxidation 


38 


18.2 


0.81 


0.63 


2.9 


264 


Mitochondrial ribosomal genes 


52 


57.6 


0.79 


0.89 


3.3 


2291 


Transcription (RNA polymerase etc.)++ 


22 


70.4 


0.59 


0.52 


3.2 


1 


Subtelomerically-encoded proteins 


36 


48.2 


0.94 


1.00 


3.9 


6174 


Iron/copper uptake 


38 


10.8 


0.82 


0.79 


3.7 


1704 


Coated vesicles/secretion 


25 


47.6 


0.61 


0.64 


3.7 


4 


Phosphoglycerides biosynthesis 


33 


36.1 


0.86 


0.61 


2.9 


27 


Hexose transporters 


10 


33.9 


0.74 


0.60 


3.8 


41 


Galactose utilization 


23 


17.4 


0.84 


0.74 


3.2 


686 


Mid sporulation 


97 


11.7 


0.90 


0.70 


2.7 


6556 


Mating factors/receptors: a/a difference 


26 


15.8 


0.57 


0.58 


3.8 


6 


Mating 


110 


31.1 


0.89 


0.75 


2.7 


24622 


Mating type a signaling genes 


6 


18.6 


0.26 


0.83 


5.5 


22 


Mating genes for mating type a 


15 


13.6 


0.41 


0.53 


8.0 


16 


Phosphate utilization 


27 


24.4 


0.89 


0.81 


3.3 


5796 


Glycolysis 


19 


26.9 


0.54 


0.89 


3.7 


91 


Ergosterol biosynthesis 


36 


28.3 


0.89 


0.69 


3.1 


57 


Cell cycle Gl/S 


66 


39.1 


0.80 


0.81 


3.7 


4382 


Cell wall (bud emergence) 


17 


42.7 


0.76 


0.94 


4.0 


63 


Cell cycle M/Gl 


35 


31.4 


0.82 


0.89 


3.9 


952 


Cell cycle G2/M 


31 


25.0 


0.82 


0.90 


3.7 


1258 


Uracil synthesis/permeases 


8 


11.4 


0.75 


0.88 


3.5 


19 


Fatty acid synthesis++ 


22 


49.4 


0.86 


0.50 


3.1 


2 


Histones 


19 


34.6 


0.67 


0.53 


3.4 


2972 


Ribosomal proteins 


126 


49.2 


0.91 


0.87 


3.0 


18661 


rRNA processing 


117 


46.0 


0.85 


0.64 


2.7 


13355 



TABLE I: 40 of the modules found by PISA that we could assign a name to. For each module we list the number of genes in 
the module, the number of conditions that had a significant contribution to the module, how consistent the module was from 
each run to the next, the maximal overlap with a module found by ISA (using 200,000 seeds at each threshold from 1.8 to 
15.0), the threshold value tG at which that overlap was found, and how many times such an ISA module was found. 
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Module: Galactose induced genes 
Number of genes: 23 

Average number of contributing conditions: 18.1 
Consistency: 0.84 

Best ISA overlap: 0.74 at threshold 3.2, frequency 
686 



GAL10 


GAL7 


GALl 


GAL3 


GAL2 


YPL066W 
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■ Hcxose transporters (downrcgulated) 
3 I Other, downregulatcd 



Other 

Raw condition scores 




FIG. S3: The galactose induced module found with PISA. 
This module turns on GAL genes and also, as a weaker effect, 
represses a number of hexose transporters. 
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Module: Hexose transporters 
Number of genes: 10 

Average number of contributing conditions: 33.7 
Consistency: 0.74 

Best ISA overlap: 0.6 at threshold 3.8, frequency 41 



HXT3 


HXT4 


HXT2 


HXT6 


HXT7 


GAL2 












YKR075C 


HXT1 
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Glucose transporter 
Galactose/glucose transporter 
Glucose suppression regulator 
Similar to glucose suppression regulator 

Raw condition scores 
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FIG. S4: The hexose transporter module found with PISA. 
In this module (which is consistently found after the galac- 
tose induced module), the hexose transporter genes are co- 
regulated with GAL2, the galactose permease, whereas they 
were counter-regulated in the galactose induced module. 
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Module: Peroxide shock 
Number of genes: 69 

Average number of contributing conditions: 23.9 
Consistency: 0.91 

Best ISA overlap: 0.34 at threshold 3.4, frequency (1) 




YKL071W 
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FIG. S5: The oxidative stress response module found with 
PISA. This module is significantly more complete than the 
modules of comparable size found by ISA. 
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Module: Zinc regulated genes 
Number of genes: 8 

Average number of contributing conditions: 29.0683 
Consistency: 0.638515 

Best ISA overlap: 0.88 at threshold 4.6, frequency 2 
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FIG. S6: The zinc module found with PISA. This module 
has a high overlap with the group of genes bound by ZAPl 
in database A (at p-value 0.001): The ZRTl, ZRT2, ZRT3, 
ZAPl and YNL254C genes make up 5 of the 6 lowest p- 
values (counting each pair of divergently transcribed genes 
only once), and the remaining hits from database A (most 
with p- values above 10 -4 ) are likely to be mostly false posi- 
tives. Based on this, it seems very likely than YNL254C, if 
functional, is regulated by and related to zinc. (ADH4 has 
also been shown to be zinc-regulated elsewhere.) 
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Module: Arginine regulation 
Number of genes: 7 

Average number of contributing conditions: 14.0667 
Consistency: 0.548283 

Best ISA overlap: 0.71 at threshold 6.0, frequency 
60 
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FIG. S7: The arginine regulated module found with PISA. 
The module agrees very well with what is known about 
regulation of arginine metabolism [F. Mesenguy and E. 
Dubois (2000) Food tech. bio. 38, 277-285]: ARGl, 
ARG3, ARG5,6 and ARG8 are repressed by arginine through 
the Arg80/Arg81/Mcml complex, while CAR2 (and CAR1, 
which is the 2nd highest scoring gene that failed to make 
the module) is activated by the same complex. We also find 
CPAl, which is claimed to be regulated by arginine at the 
translational level — the mRNA is destabilized by a small pep- 
tide in the presence of arginine. However, database A indi- 
cates that ARGl, ARG3, ARG5,6, ARG8 and CPAl are all 
bound by the Arg80/Arg81/Mcml complex. 



